regression method
The Generalised Kernel Covariance Measure
Bergen, Luca, Sejdinovic, Dino, Didelez, Vanessa
We consider the problem of conditional independence (CI) testing and adopt a kernel-based approach. Kernel-based CI tests embed variables in reproducing kernel Hilbert spaces, regress their embeddings on the conditioning variables, and test the resulting residuals for marginal independence. This approach yields tests that are sensitive to a broad range of conditional dependencies. Existing methods, however, rely heavily on kernel ridge regression, which is computationally expensive when properly tuned and yields poorly calibrated tests when left untuned, which limits their practical usefulness. We propose the Generalised Kernel Covariance Measure (GKCM), a regression-model-agnostic kernel-based CI test that accommodates a broad class of regression estimators. Building on the Generalised Hilbertian Covariance Measure framework (Lundborg et al., 2022), we characterise conditions under which GKCM satisfies uniform asymptotic level guarantees. In simulations, GKCM paired with tree-based regression models frequently outperforms state-of-the-art CI tests across a diverse range of data-generating processes, achieving better type I error control and competitive or superior power.
Semi-Supervised Contrastive Learning for Deep Regression with Ordinal Rankings from Spectral Seriation
Contrastive learning methods can be applied to deep regression by enforcing label distance relationships in feature space. However, these methods are limited to labeled data only unlike for classification, where unlabeled data can be used for contrastive pretraining. In this work, we extend contrastive regression methods to allow unlabeled data to be used in a semi-supervised setting, thereby reducing the reliance on manual annotations. We observe that the feature similarity matrix between unlabeled samples still reflect inter-sample relationships, and that an accurate ordinal relationship can be recovered through spectral seriation algorithms if the level of error is within certain bounds. By using the recovered ordinal relationship for contrastive learning on unlabeled samples, we can allow more data to be used for feature representation learning, thereby achieve more robust results. The ordinal rankings can also be used to supervise predictions on unlabeled samples, which can serve as an additional training signal. We provide theoretical guarantees and empirical support through experiments on different datasets, demonstrating that our method can surpass existing state-of-the-art semi-supervised deep regression methods. To the best of our knowledge, this work is the first to explore using unlabeled data to perform contrastive learning for regression.
Penalized Fair Regression for Multiple Groups in Chronic Kidney Disease
Nakamoto, Carter H., Chen, Lucia Lushi, Foryciarz, Agata, Rose, Sherri
Fair regression methods have the potential to mitigate societal bias concerns in health care, but there has been little work on penalized fair regression when multiple groups experience such bias. We propose a general regression framework that addresses this gap with unfairness penalties for multiple groups. Our approach is demonstrated for binary outcomes with true positive rate disparity penalties. It can be efficiently implemented through reduction to a cost-sensitive classification problem. We additionally introduce novel score functions for automatically selecting penalty weights. Our penalized fair regression methods are empirically studied in simulations, where they achieve a fairness-accuracy frontier beyond that of existing comparison methods. Finally, we apply these methods to a national multi-site primary care study of chronic kidney disease to develop a fair classifier for end-stage renal disease. There we find substantial improvements in fairness for multiple race and ethnicity groups who experience societal bias in the health care system without any appreciable loss in overall fit.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The paper proposes a new regression method, namely calibrated multivariate regression (CMR), for high dimensional data analysis. Besides proposing the CMR formulation, the paper focuses on (1) using a smoothed proximal gradient method to compute CMR's optimal solutions; (2) analyzing CMR' statical properties. One key contribution of the paper lies in the introduction of this CMR formulation; its loss term can be interpreted as calibrating each regression task's loss term with respect to its noise level. I am wondering whether there is any more intuitive interpretation behind the use of the noise level for calibration?
Data-driven Discovery of Digital Twins in Biomedical Research
Métayer, Clémence, Ballesta, Annabelle, Martinelli, Julien
Recent technological advances have expanded the availability of high-throughput biological datasets, enabling the reliable design of digital twins of biomedical systems or patients. Such computational tools represent key reaction networks driving perturbation or drug response and can guide drug discovery and personalized therapeutics. Yet, their development still relies on laborious data integration by the human modeler, so that automated approaches are critically needed. The success of data-driven system discovery in Physics, rooted in clean datasets and well-defined governing laws, has fueled interest in applying similar techniques in Biology, which presents unique challenges. Here, we reviewed methodologies for automatically inferring digital twins from biological time series, which mostly involve symbolic or sparse regression. We evaluate algorithms according to eight biological and methodological challenges, associated to noisy/incomplete data, multiple conditions, prior knowledge integration, latent variables, high dimensionality, unobserved variable derivatives, candidate library design, and uncertainty quantification. Upon these criteria, sparse regression generally outperformed symbolic regression, particularly when using Bayesian frameworks. We further highlight the emerging role of deep learning and large language models, which enable innovative prior knowledge integration, though the reliability and consistency of such approaches must be improved. While no single method addresses all challenges, we argue that progress in learning digital twins will come from hybrid and modular frameworks combining chemical reaction network-based mechanistic grounding, Bayesian uncertainty quantification, and the generative and knowledge integration capacities of deep learning. To support their development, we further propose a benchmarking framework to evaluate methods across all challenges.
Projection-based multifidelity linear regression for data-scarce applications
Sella, Vignesh, Pham, Julie, Willcox, Karen, Chaudhuri, Anirban
An important challenge in scientific machine learning is to develop methods that can exploit and maximize the amount of learning possible from scarce data [1-4]. The need for such methods arises often in science and engineering, especially in the case of computational fluid dynamics (CFD), since expensive-to-evaluate high-fidelity (HF) models make many-query problems such as uncertainty quantification, risk analysis, optimization, and optimization under uncertainty computationally prohibitive [5]. Surrogate models that approximate the solutions to HF models can facilitate the design and analysis process; however, lack of sufficient HF data in tandem with high-dimensional quantities of interest adversely affect surrogate model accuracy. We propose multifidelity (MF) linear regression methods that leverage abundant low-cost, lower-fidelity (LF) data alongside limited HF data to construct linear regression models. These models operate within a reduced-dimensional subspace, obtained through the principal component analysis (PCA), to effectively handle both training data scarcity and the high dimensionality (on the order of tens of thousands of quantities of interest) inherent in our problem setting. Linear regression has been widely utilized as a surrogate modeling approach in aerospace applications due to its simplicity and interpretability. We note that linear regression encompasses a broad class of models that are linear in their parameters but can include features that are arbitrarily nonlinear functions of the input variables [6].
SplitWise Regression: Stepwise Modeling with Adaptive Dummy Encoding
Kurbucz, Marcell T., Tzivanakis, Nikolaos, Aslam, Nilufer Sari, Sykulski, Adam M.
Capturing nonlinear relationships without sacrificing interpretability remains a persistent challenge in regression modeling. We introduce SplitWise, a novel framework that enhances stepwise regression. It adaptively transforms numeric predictors into threshold-based binary features using shallow decision trees, but only when such transformations improve model fit, as assessed by the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). This approach preserves the transparency of linear models while flexibly capturing nonlinear effects. Implemented as a user-friendly R package, SplitWise is evaluated on both synthetic and real-world datasets. The results show that it consistently produces more parsimonious and generalizable models than traditional stepwise and penalized regression techniques.